Developing JSequitur to Study the Hierarchical Structure of Biological Sequences in a Grammatical Inference Framework of String Compression Algorithms

نویسندگان

  • Bulgan Galbadrakh
  • Kyung-Eun Lee
  • Hyun-Seok Park
چکیده

Grammatical inference methods are expected to find grammatical structures hidden in biological sequences. One hopes that studies of grammar serve as an appropriate tool for theory formation. Thus, we have developed JSequitur for automatically generating the grammatical structure of biological sequences in an inference framework of string compression algorithms. Our original motivation was to find any grammatical traits of several cancer genes that can be detected by string compression algorithms. Through this research, we could not find any meaningful unique traits of the cancer genes yet, but we could observe some interesting traits in regards to the relationship among gene length, similarity of sequences, the patterns of the generated grammar, and compression rate.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Comparative Study of String Dissimilarity Measures in Structural Clustering

This paper addresses structural clustering by stressing the distinction between string matching and structural resemblance. The hierarchical agglomerative clustering concept and a partitional approach are explored in a comparative study of several dissimilarity measures: minimum code length based measures; dissimilarity based on the concept of reduction in grammatical complexity; string matchin...

متن کامل

Bayesian Estimation of Parameters in the Exponentiated Gumbel Distribution

Abstract: The Exponentiated Gumbel (EG) distribution has been proposed to capture some aspects of the data that the Gumbel distribution fails to specify. In this paper, we estimate the EG's parameters in the Bayesian framework. We consider a 2-level hierarchical structure for prior distribution. As the posterior distributions do not admit a closed form, we do an approximated inference by using ...

متن کامل

An Introduction to Inference and Learning in Bayesian Networks

Bayesian networks (BNs) are modern tools for modeling phenomena in dynamic and static systems and are used in different subjects such as disease diagnosis, weather forecasting, decision making and clustering. A BN is a graphical-probabilistic model which represents causal relations among random variables and consists of a directed acyclic graph and a set of conditional probabilities. Structure...

متن کامل

Relation Between RNA Sequences, Structures, and Shapes via Variation Networks

Background: RNA plays key role in many aspects of biological processes and its tertiary structure is critical for its biological function. RNA secondary structure represents various significant portions of RNA tertiary structure. Since the biological function of RNA is concluded indirectly from its primary structure, it would be important to analyze the relations between the RNA sequences and t...

متن کامل

Mining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM

Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 10  شماره 

صفحات  -

تاریخ انتشار 2012